-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor pandas save load and convert dtypes #3412
Refactor pandas save load and convert dtypes #3412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I haven't fully tested zarr yet I want to make sure. We have an appropriate pandas warning somewhere for users so they know they need pandas for these features. I know we have a warning for qualitymetrics do we have one for templatemetrics?
@@ -287,7 +287,7 @@ def _compute_metrics(self, sorting_analyzer, unit_ids=None, verbose=False, **job | |||
warnings.warn(f"Error computing metric {metric_name} for unit {unit_id}: {e}") | |||
value = np.nan | |||
template_metrics.at[index, metric_name] = value | |||
return template_metrics | |||
return template_metrics.convert_dtypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weak recommendation ,but maybe we put this on its own line with a comment. Just from reading this I have no clue why we need to do this and doing this in the return line is even more confusing. So something like
# see xx
template_metrics.convert_dtypes()
return template_metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better?
@@ -185,7 +185,7 @@ def _compute_metrics(self, sorting_analyzer, unit_ids=None, verbose=False, **job | |||
if len(empty_unit_ids) > 0: | |||
metrics.loc[empty_unit_ids] = np.nan | |||
|
|||
return metrics | |||
return metrics.convert_dtypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. From the code it is not clear why we need to convert dtypes so I would refer to divide this into a convert step and then only return the converted. That way we can have a comment explaining why we need to convert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment and convert step
We don't have warnings anywhere. If a user tries to compute template or quality metrics without pandas, it will throw an interpetable |
One last comment: for analyzers saved to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me!
warnings.warn( | ||
"The zarr store was not properly consolidated prior to v0.101.1. " | ||
"This may lead to unexpected behavior in loading extensions. " | ||
"Please consider re-saving the SortingAnalyzer object." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a save_as(format='zarr')
? I just want to make sure the error is as clear as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really..since the problem is consolidation, the save as may fail to discover all the pieces of the folder. I changed it to re-generating
.
Honestly, I don't think this will be an issue since it will only happen if:
- you saved to zarr between 0.101.0 and 0.101.1
- you don't have write access to the data
|
||
# we use the convert_dtypes to convert the columns to the most appropriate dtype and avoid object columns | ||
# (in case of NaN values) | ||
template_metrics = template_metrics.convert_dtypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I think that's great!
Thanks Alessio! |
We found out that zarr consolidation doens't seem to play well with our way of saving/loading dataframes to zarr, using xarray.
xarray
was only used to save/load pandas dataframes to zarr, and this PR modifies that by saving each column and the index directly. This is similar to how xarray saves to zarr, so it should be backcompatible (testing now).To make sure we don't run in such problems in the future, I added a roundtrip test in the common extension tests that asserts that data reloaded is the same as the original ones.
As sugegsted by @h-mayorquin here #3365, the generated and reloaded dataframes are also converted to numpy dtypes with the
convert_dtypes
function. We just have to make sure to call aSeries.to_numpy
to cast pandas dtypes to numpy ones.